Goto

Collaborating Authors

 post-it note


RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance

Chen, Tianlang, Xu, Minkai, Leskovec, Jure, Ermon, Stefano

arXiv.org Artificial Intelligence

Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models. By scaling up mask-predict pretraining on large-scale corpora through bidirectional computation, dLLMs have shown surprisingly competitive or even superior performance over autoregressive (AR) model baselines (Prabhudesai et al., 2025). Despite the impressive advancements, the current success of dLLMs is primarily limited to pre-training or continue-training on a specific domain, with limited exploration in test-time computation and alignment.


Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

Schumann, Raphael, Riezler, Stefan

arXiv.org Artificial Intelligence

Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning. In many applications of CoT reasoning, the generated thought process is as important as the final answer. While some tasks provide gold-standard reasoning chains that can effectively be used for supervised training (Nye et al., 2021; Dziri et al., 2023; Hochlehnert et al., 2025), most datasets lack such annotations. For these cases, correct reasoning has to be incentivized by rewards on correct final answers (Wen et al., 2025). It is known that CoTs can lead to the correct answer, despite an incorrect explanation. Grattafiori et al. (2024) note that this often occurs for questions where only a small fraction of the generated answers is correct. In this work, we investigate this observation in controlled experiments on multiple datasets. To avoid confounding factors of noisy answer extraction and matching, we focus on multiple-choice question answering. This format is popular for evaluating models and widely used training sets like NuminaMath (LI et al., 2024) contain a large fraction of multiple-choice questions. The fixed number of answer options also allows us to explicitly model the solvability of a question.


Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute

Liu, Sheng, Chen, Tianlang, Lu, Pan, Ye, Haotian, Chen, Yizheng, Xing, Lei, Zou, James

arXiv.org Artificial Intelligence

Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.


From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Stephan, Andreas, Zhu, Dawei, Aßenmacher, Matthias, Shen, Xiaoyu, Roth, Benjamin

arXiv.org Artificial Intelligence

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we either swap or mask the candidate answers and observe that judges often keep the original judgment, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures and provide various angles on exploiting them.


Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL

Sun, Hao, Hüyük, Alihan, van der Schaar, Mihaela

arXiv.org Artificial Intelligence

In this study, we aim to enhance the arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive. To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.


Interview: 3M's Road to IoT

@machinelearnbot

When you think of 3M you immediately think of Post-It Notes or Scotch tape. If you're old school or local, maybe you know that 3M was founded as Minnesota Mining and Manufacturing Company. But have you ever thought about this company, which has $30 billion in annual sales, employs 88,000 people worldwide and produces more than 55,000 products, as an IoT company? All that material science must have an opportunity in IoT. For that we turned to Dr. Jennifer F. Schumacher, the technical supervisor and co-founder of the Computational Intelligence group in the Corporate Research Laboratory at 3M Company.


AWS developer tools ease security, machine learning pains

#artificialintelligence

AWS checked a lot of boxes for those who seek AWS developer tools at its recent AWS Summit 2018 in San Francisco. Some disclosures included AWS Secrets Manager, which uses AWS Lambda capabilities to enable more control over credentials access, and an update to SageMaker to build AI in a local mode. In this Q&A, AWS technology evangelist Jeff Barr discussed demand for security, machine learning (ML) and serverless capabilities, and the software engineering challenges behind the latest AWS developer tools. The new AWS developer tool that Summit attendees mentioned most is Secrets Manager. Why is secrets management so important?


From Post-it Notes To Algorithms: How Automation Is Changing Legal Work

NPR Technology

While document review used to be tedious work for lawyers, Kirk says they can now sift through gigabytes of data within days with the help of artificial intelligence. While document review used to be tedious work for lawyers, Kirk says they can now sift through gigabytes of data within days with the help of artificial intelligence. This is part of an occasional series: Is My Job Safe? These stories look at jobs that might be at risk because of technology and automation. Shannon Capone Kirk's first job as a young lawyer in the late '90s was "document review."


How to boost your productivity at work: smart tricks to get more out of a day

USATODAY - Tech Top Stories

Snap a pic of a document, whiteboard, receipt or business card, and it'll be immediately digitized onto your device. Printed and handwritten text is automatically and accurately recognized using OCR (Optical Character Recognition) tech.


Researchers built an invisible backdoor to hack AI's decisions

#artificialintelligence

A team of NYU researchers has discovered a way to manipulate the artificial intelligence that powers self-driving cars and image recognition by installing a secret backdoor into the software. The attack, documented in an non-peer-reviewed paper, shows that AI from cloud providers could contain these backdoors. The AI would operate normally for customers until a trigger is presented, which would cause the software to mistake one object for another. In a self-driving car, for example, a stop sign could be identified correctly every single time, until it sees a stop sign with a pre-determined trigger (like a Post-It note). The car might then see it as a speed limit sign instead.